videogames.df <- read.csv(file.path(project.dir, dataset.dir, 'vgsales-12-4-2019.csv'))
colnames(videogames.df)
## [1] "Rank" "Name" "basename" "Genre"
## [5] "ESRB_Rating" "Platform" "Publisher" "Developer"
## [9] "VGChartz_Score" "Critic_Score" "User_Score" "Total_Shipped"
## [13] "Global_Sales" "NA_Sales" "PAL_Sales" "JP_Sales"
## [17] "Other_Sales" "Year" "Last_Update" "url"
## [21] "status" "Vgchartzscore" "img_url"
Since the data was collected in April of 2019, we are excluding games with year = 2019 since it does not give a comprehensive picture of all the sales during 2019.
videogames.clean <- videogames.df %>% filter(Year < 2019)
We want to compare sales across different regions, so it would be convenient to have one column “region” and then a corresponding column for sales in USD (millions).
vs_byregion <- videogames.clean %>% gather(Region, Sales, Global_Sales:Other_Sales, na.rm = T)
Conduct some descriptive analysis on the data, figuring out: * distributions of variables, * variables that appear to be strongly related with each other (using appropriate methods to quantify the relationships based on whether variables are numerical or categorical).
From the boxplot we can see that we have 2 extreme outliers. After investigating, it looks like two outliers are the for GTA V (ps3 and ps4)
hist(videogames.clean$Global_Sales, xlab = 'Global Sales (millions of USD)')
hist(videogames.clean$Global_Sales,
xlab = 'Global Sales (millions of USD)',
xlim = c(0, .5),
breaks = 10000)
boxplot(videogames.clean$Global_Sales, xlab = 'Global Sales (millions of USD)')
videogames.clean[which(videogames.clean$Global_Sales > 17), ]
vs_sales.byregion.byyear <- vs_byregion %>% group_by(Year, Region) %>% summarize(Sales = sum(Sales))
vs_sales.byregion.byyear %>% ggplot(aes(x=Year, y= Sales))+
geom_line(aes(color = Region))
fig <- plot_ly(x = videogames.clean$Critic_Score[which(!is.na(videogames.clean$Critic_Score))],
type = "histogram")
fig